Instance Evacuation prevent baseline re-computation after dropped #2740

zpinto · 2024-01-23T02:54:15Z

Issues

Instance Evacuation is causing shuffling after dropping an evacuated node. This is a result of new baseline being computed.

Description

We are now controlling the filtering of certain "Assignable" instances through the BaseControllerDataProvider. This will help centralize the logic and decouple it with the rebalancers. This PR migrates EVACUATE to this approach. Putting this logic here, allows methods like getAssignableInstanceConfigMap to respect the "Assignable" states. When the return value for this changes, a new baseline is computed. If it differs, which it will for evacuate, it will cause shuffling.

When the baseline is calculated, whether it is enabled, live, or set to EVACUATE is not respected. The assignment assumes the "happy path" where all of these instances are assignable or will be assignable. When and EVACUATE node is dropped, it will force a differing baseline to be calculated. This will cause shuffling after the evacuation is complete.

To fix this, setting and InstanceOperation to EVACUATE will ensure that it does not get returned in getAssignable*. This forces the baseline to be recomputed without considering the EVACUATE instance. When the instance is dropped, the arguments to baseline computation are the same as the previous computation and there is no change.

To maintain N -> N + 1 -> N behavior, we change the computeBestPossiblePartition state to pass all liveInstances so that the replicas on the EVACUATE instance can still be part of the stateMap until the new replica is created.

Tests

Update testEvacuate to ensure no additional shuffling after the evacuated instance is dropped.

Changes that Break Backward Compatibility (Optional)

Removed some public methods; however, they were relatively new and would likely cause more confusion if they were left. If objections, I can deprecate them instead. Also, I don't think any were part of an open-source release.

Documentation (Optional)

NA

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

…re that baseline only changes once when EVACUATE IstanceOperation is set and not again after the instance is dropped from the cluster.

desaikomal

overall change looks good. i just want to make sure we are not duplicating information

...core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java

zpinto · 2024-01-29T23:00:52Z

Failed one flaky test: TestClusterAccessor.testGetClusters

TestClusterAccessor.testGetClusters:102 clusters from response: [TaskTestCluster, TestCluster_0, TestCluster_1, superCluster, TestCluster_2, TestCluster_3, TestCluster_4, StoppableTestCluster, StoppableTestCluster2] vs clusters actually: [TestCluster_0, TaskTestCluster, TestCluster_1, superCluster, TestCluster_2, TestCluster_3, TestCluster_4, StoppableTestCluster, StoppableTestCluster2]: Lists differ at element [0]: TestCluster_0 != TaskTestCluster expected:<TestCluster_0> but was:<TaskTestCluster>

The sets of clusters are actually the same but they are being compared as lists for some reason. They have different orders. Will raise a separate PR to fix this.

desaikomal

thanks for explaining offline, looks good.

zpinto · 2024-01-29T23:48:42Z

This PR is ready to be merged!

Final commit message:
Instance Evacuation logic is moved to BaseControllerDataProvider cache and will only cause baseline recalculation when the evacuate is triggered instead of evacuating the node and then recalculating baseline after node is dropped causing more shuffling. Also fix getAllLiveInstances to exclude TimedOutForMaintenance nodes.

xyuanlu approved these changes Jan 25, 2024

View reviewed changes

zpinto added 2 commits January 26, 2024 15:16

Refactor evacuation logic into the BaseControllerDataProvider to ensu…

7085865

…re that baseline only changes once when EVACUATE IstanceOperation is set and not again after the instance is dropped from the cluster.

Fix node timeout in MM.

e802c3b

zpinto force-pushed the zpinto/evacuate_only_affects_baseline_once branch from 0bce512 to e802c3b Compare January 26, 2024 23:22

github-actions bot added the CheckedAndApproved label Jan 26, 2024

Fix updateInstance in TestPerInstanceAccessor.

135b2fc

desaikomal reviewed Jan 29, 2024

View reviewed changes

...core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java Show resolved Hide resolved

...core/src/main/java/org/apache/helix/controller/dataproviders/BaseControllerDataProvider.java Show resolved Hide resolved

desaikomal approved these changes Jan 29, 2024

View reviewed changes

xyuanlu merged commit 054d77a into apache:master Jan 30, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Evacuation prevent baseline re-computation after dropped #2740

Instance Evacuation prevent baseline re-computation after dropped #2740

zpinto commented Jan 23, 2024 •

edited

Loading

desaikomal left a comment

zpinto commented Jan 29, 2024

desaikomal left a comment

zpinto commented Jan 29, 2024

Instance Evacuation prevent baseline re-computation after dropped #2740

Instance Evacuation prevent baseline re-computation after dropped #2740

Conversation

zpinto commented Jan 23, 2024 • edited Loading

Issues

Description

Tests

Changes that Break Backward Compatibility (Optional)

Documentation (Optional)

Commits

Code Quality

desaikomal left a comment

Choose a reason for hiding this comment

zpinto commented Jan 29, 2024

desaikomal left a comment

Choose a reason for hiding this comment

zpinto commented Jan 29, 2024

zpinto commented Jan 23, 2024 •

edited

Loading